For the following exercises, we will again use data on population from Gapminder.
As per usual, we first need to read in the data. You can just copy, paste and run the following code in(to) your script.
library(readr)
library(dplyr)
gap_pop <- read_csv("../data/gapminder/population_total.csv") %>%
rename(country = "Total population")
Again, the data are currently in wide format.
starts_with(). We also want to keep the country column.
gap_pop %>%
select(country, starts_with("19"))
## # A tibble: 275 x 56
## country `1900` `1910` `1920` `1930` `1940` `1950` `1951` `1952`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhaz~ NA NA NA NA NA NA NA NA
## 2 Afghan~ 5021241 5351413 5813814 6394908 7034081 7752118 7839426 7934798
## 3 Akroti~ NA NA NA NA NA 10661 10737 10813
## 4 Albania 819950 901122 963956 1015991 1123210 1263171 1287499 1316086
## 5 Algeria 4946166 5404045 6063800 6876190 7797418 8872247 9039913 9216395
## 6 Americ~ 5949 7047 8173 10081 13135 18937 19295 19543
## 7 Andorra 4393 4671 4974 5309 5667 6197 6692 7250
## 8 Angola 2898155 3136718 3387663 3642200 3920011 4354882 4439705 4529381
## 9 Anguil~ 3561 3818 4097 4400 4725 5121 5297 5438
## 10 Antigu~ 34925 32119 30000 33647 38495 46301 48306 49887
## # ... with 265 more rows, and 47 more variables: `1953` <dbl>,
## # `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>,
## # `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>,
## # `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>,
## # `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## # `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## # `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## # `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## # `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## # `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## # `1999` <dbl>
As you may have already noticed, the dataset comprises some missing data points. Before we start analyzing the data, we might want to know for how many countries we have complete data.
drop_na() function from tidyr.
library(tidyr)
gap_pop %>%
drop_na() %>%
nrow()
## [1] 229
As in the previous set of data wrangling exercises, we now want to transform the data into the long format.
integer.
mutate().
gap_pop <- gap_pop %>%
gather(-country, key = "year", value = "pop") %>%
mutate(year = as.integer(year))
Now let’s apply some of the advanced filtering options we discussed in the Data Wrangling - Part 2 session.
Create two new dataframes that include different subets of the gap_pop data:
Data for all countries for the 1990s (name this one gap_pop_1990s),
Data for all years but only for Germany (name this one gap_pop_ger).
dplyr to create the first new data frame and a specific matching operator to create the second one.
gap_pop_1990s <- gap_pop %>%
filter(between(year, 1990, 1999))
gap_pop_1990s
## # A tibble: 2,750 x 3
## country year pop
## <chr> <int> <dbl>
## 1 Abkhazia 1990 NA
## 2 Afghanistan 1990 12067570
## 3 Akrotiri and Dhekelia 1990 14127
## 4 Albania 1990 3281453
## 5 Algeria 1990 25912364
## 6 American Samoa 1990 47044
## 7 Andorra 1990 54511
## 8 Angola 1990 11127870
## 9 Anguilla 1990 8334
## 10 Antigua and Barbuda 1990 61906
## # ... with 2,740 more rows
gap_pop_ger <- gap_pop %>%
filter(country %in%
c("Germany", "West Germany", "East Germany"))
gap_pop_ger
## # A tibble: 243 x 3
## country year pop
## <chr> <int> <dbl>
## 1 East Germany 1800 NA
## 2 Germany 1800 22886919
## 3 West Germany 1800 NA
## 4 East Germany 1810 NA
## 5 Germany 1810 23882461
## 6 West Germany 1810 NA
## 7 East Germany 1820 NA
## 8 Germany 1820 25507768
## 9 West Germany 1820 NA
## 10 East Germany 1830 NA
## # ... with 233 more rows
For some comparisons (especially via plots), it might help to know which continent the country is located on. For this purpose, we will create a new continent variable. As it would be quite tedious to create this variable manually for all of the countries in the dataset, we will do this only for a subset in this exercise. Just run the following code in your local script to create this subset.
gap_pop_subset <- gap_pop %>%
filter(country %in%
c("Netherlands", "Brazil", "China", "Algeria", "New Zealand"))
recode_factor() to create the new variable. Alternatively, you could also use case_when() here. However, the latter would require more typing which is something that we generally want to avoid.
gap_pop_subset %>%
mutate(continent = recode_factor(country,
"Algeria" = "Africa",
"Brazil" = "Americas",
"China" = "Asia",
"New Zealand" = "Oceania"
))
## # A tibble: 405 x 4
## country year pop continent
## <chr> <int> <dbl> <fct>
## 1 Algeria 1800 2503218 Africa
## 2 Brazil 1800 3639636 Americas
## 3 China 1800 321675013 Asia
## 4 Netherlands 1800 2254522 Netherlands
## 5 New Zealand 1800 100000 Oceania
## 6 Algeria 1810 2595056 Africa
## 7 Brazil 1810 4058652 Americas
## 8 China 1810 350542958 Asia
## 9 Netherlands 1810 2293548 Netherlands
## 10 New Zealand 1810 100000 Oceania
## # ... with 395 more rows
# alternative solution using case_when
# gap_pop_subset %>%
# mutate(continent = factor(case_when(
# country == "Algeria" ~ "Africa",
# country == "Brazil" ~ "Americas",
# country == "China" ~ "Asia",
# country == "Netherlands" ~ "Europe",
# country == "New Zealand" ~ "Oceania")
# ))